The MaSuRCA genome assembler

نویسندگان

  • Aleksey V. Zimin
  • Guillaume Marçais
  • Daniela Puiu
  • Michael Roberts
  • Steven Salzberg
  • James A. Yorke
چکیده

MOTIVATION Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka'). RESULTS We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. AVAILABILITY MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. CONTACT [email protected]. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50)

Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reporte...

متن کامل

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data...

متن کامل

MERmaid: A Parallel Genome Assembler for the Cloud

Modern genome sequencers are capable of producing millions to billions of short reads of DNA. Each new generation of genome sequencers is able to provide an order of magnitude more data than the previous, resulting in an exponential increase in required data processing throughput. The challenge today is to build a software genome assembler that is highly parallel, fast, and inexpensive to run. ...

متن کامل

Parallelization of MIRA Whole Genome and EST Sequence Assembler

The genome assembly problem is to generate the original DNA sequence of the organism from a large set of short overlapping fragments. MIRA is an open source assembler based on the Overlap Layout Consensus (OLC) graph model which addresses the assembly problem and is widely used by biologists [1,2]. Like other assemblers MIRA takes a long time to compute the assembly for large number of sequence...

متن کامل

Determination of Material Flows in a Multi-echelon Assembly Supply Chain

This study aims to minimize the total cost of a four-echelon supply chain including suppliers, an assembler, distributers, and retailers. The total cost consists of purchasing raw materials from the suppliers by the assembler, assembling the final product, materials transportation from the suppliers to the assembler, product transportation from the assembler to the distributors, product transpo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 29 21  شماره 

صفحات  -

تاریخ انتشار 2013